1 Executive Summary

The present report investigated a set of univariate time series datasets from the UEA & UCR Time Series Classification Repository. A subset of datasets were webscraped and processed using a data wrangling pipeline. One specific dataset - SonyAIBORobotSurface1 - was selected for a deep dive. The dataset comprises measurements of a Sony AIBO Robot moving across either carpet or cement, and the present analysis sought to understand the structure of this data. A feature-based time-series analysis approach was taken, which revealed an interesting structure and strong potential for features to be able to classify unseen data. An SVM was fit and produced an 86% classification accuracy of the surface under the robot being either carpet or cement. This is practically significant, as this algorithm could be used in pseudo-real-time applications either remotely or within robot electronics to predict terrain type, risk, and other information of interest with an acceptably high accuracy.

2 Exploring the Dataset

Time series is a ubiquitous form of data across business and the sciences. A time series is a temporally-ordered sequence of values, typically taken at evenly-spaced points in time. Analysis of time series can typically be summarised as four distinct types:

  • Decomposition
  • Forecasting
  • Classification
  • Empirical structure

Recent empirical research has sought to use the ideas of finding empirical structure in data as a means to classify time series. This approach has meaningful practical significance as data-driven classification tasks become even more prominent in the rapidly growing availability of temporal data. Examples include classifying schizophrenic from control using functional magnetic resonance imaging, star type using light-curve readings, and customer segmentation based on clicks on a webpage. Stakeholders for this type of work are far reaching, including customers, patients, medical practitioners, researchers, and C-suite executives, among many others.

This report details a pipeline for webscraping, tidying, wrangling, and visualising a subset of general time series classification datasets. It then builds on this work by using a novel approach to time-series analysis to understand any empirical structure in the data, ascertain the presence of any relationships in a low dimensional space, and train a classification algorithm that can distinguish between two classes accurately on unseen data. The novelty and reproducibility of the analytical workflow presented in this report is likely of as much interest to stakeholders as the classification algorithm output itself. This is because the data cleaning, processing, and analysis workflow has been written in such as way that it can be generalised easily to many other time-series datasets. Learnings and implications for practice are also discussed.

library(data.table)
library(dplyr)
library(magrittr)
library(tidyr)
library(ggplot2)
library(scales)
library(tibble)
library(tidytext)
library(foreign)
library(Rcatch22)
library(theft) # Not quite on CRAN yet - devtools::install_github("hendersontrent/theft")
library(e1071)
library(caret)

2.1 Data Provenance

The UEA & UCR Time Series Classification Repository is a website which contains a large set of univariate and multivariate time-series datasets that are used for benchmarking and testing of classification algorithms. The repository is maintained by prominent time series authors across two universities. The repository currently includes 128 univariate and 30 multivariate time-series datasets. These datasets included in the repository are considered the gold standard for which researchers and analysts must benchmark their novel approaches against. Many papers have been presented on algorithms fit to these datasets, such as the popular The Great Time Series Classification Bake-Off. While questions regarding the individual datasets themselves that comprise the repository could be raised, the scope of this is far beyond that of the present report. However, given the prominence of the repository as a standardised benchmarking library in the time series field, it can likely be taken that the reliability and replicability of the measurements contained within is high.

2.2 Domain knowledge

All data contained and reported within this documented was sourced directly from the repository webpage using a simple webscraper that was written to only extract the univariate two-class problems. There are no concerns with privacy or ethics for this data as the repository openly enables (and encourages) downloads. However, some domain knowledge of the format the data is stored in is needed for automated downloading and parsing. The author spent time manually inspecting the structure of a sample of the target datasets prior to writing the webscraper to ensure it would work in the most automated way possible. Some prior knowledge of indicative time series dataset structure is needed to perform this inspection. The webscraper was written simply to programmatically download and extract only the two-class problems, rather than downloading the entire univariate library from the start - which is a much larger file. This download is stored in a temporary file which is deleted upon closure of the R session to save local disk memory. However, the user can specify to save the cleaned and parsed version of the data, which is more memory efficient than storing both the initial zip file and the processed data. While the number and range of datasets contained in the repository spans a variety, the datasets do not cover all fields of science. Importantly, the datasets do not include any synthetic (simulated) data. This means any inferences drawn from the applied datasets contained within may not generalise to synthetic data which are often used for algorithm development.

#' Function to automatically webscrape and parse Time Series Classification univariate two-class classification datasets
#' 
#' NOTE: The dictionary list used to identify and pass two-class problems only should be switched to a dynamic
#' webscrape table read to ensure it can scale as the dataset structure changes/is added to.
#' 
#' @return a dataframe object in tidy form
#' @author Trent Henderson
#' 

pullTSCprobs <- function(){
  
  # --------------- Set up dictionary -------------
  
  # Not all the datasets are two-class problems. Define dictionary from
  # website of two-class problems to filter downloaded dataset by
  # Source: http://www.timeseriesclassification.com/dataset.php
  
  twoclassprobs <- c("Yoga", "WormsTwoClass", "Wine", 
                     "Wafer", "TwoLeadECG", "ToeSegmentation2", 
                     "ToeSegmentation1", "Strawberry", "SonyAIBORobotSurface2", 
                     "SonyAIBORobotSurface1", "SharePriceIncrease", "ShapeletSim", 
                     "SemgHandGenderCh2", "SelfRegulationSCP2", "SelfRegulationSCP1", 
                     "RightWhaleCalls", "ProximalPhalanxOutlineCorrect", "PowerCons",
                     "PhalangesOutlinesCorrect", "MotorImagery", "MoteStrain", 
                     "MiddlePhalanxOutlineCorrect", "Lightning2", "ItalyPowerDemand", 
                     "HouseTwenty", "Herring", "HandOutlines", "Ham", "GunPointOldVersusYoung", 
                     "GunPointMaleVersusFemale", "GunPointAgeSpan", "GunPoint", 
                     "FreezerSmallTrain", "FreezerRegularTrain", "FordB",
                     "FordA", "ECGFiveDays", "ECG200", "Earthquakes", "DodgerLoopWeekend", 
                     "DodgerLoopGame", "DistalPhalanxOutlineCorrect", "Computers", 
                     "Coffee", "Chinatown", "BirdChicken", "BeetleFly")
  
  # --------------- Webscrape the data ------------
  
  temp <- tempfile()
  download.file("http://www.timeseriesclassification.com/Downloads/Archives/Univariate2018_arff.zip", temp, mode = "wb")
  
  # --------------- Parse into problems -----------
  
  problemStorage <- list()
  message("Parsing individual datasets...")
  
  for(i in twoclassprobs){
    
    tryCatch({
      
      path <- paste0("Univariate_arff/",i,"/")
      
      # Retrieve TRAIN and TEST files
      
      train <- foreign::read.arff(unz(temp, paste0(path,i,"_TRAIN.arff"))) %>%
        mutate(id = row_number()) %>%
        mutate(set_split = "Train")
      
      themax <- max(train$id) # To add in test set to avoid duplicate IDs
      
      test <- foreign::read.arff(unz(temp, paste0(path,i,"_TEST.arff"))) %>%
        mutate(id = row_number()+themax) %>% # Adjust relative to train set to stop double-ups
        mutate(set_split = "Test")
      
      #----------------------------
      # Wrangle data to long format
      #----------------------------
      
      # Train
      
      thecolstr <- colnames(train)
      keepcolstr <- thecolstr[!thecolstr %in% c("target", "id", "set_split")]
      
      train2 <- train %>%
        mutate(problem = i) %>%
        pivot_longer(cols = all_of(keepcolstr), names_to = "timepoint", values_to = "values") %>%
        mutate(timepoint = as.numeric(gsub(".*?([0-9]+).*", "\\1", timepoint)))
      
      # Test
      
      thecolste <- colnames(test)
      keepcolste <- thecolste[!thecolste %in% c("target", "id", "set_split")]
      
      test2 <- test %>%
        mutate(problem = i) %>%
        pivot_longer(cols = all_of(keepcolste), names_to = "timepoint", values_to = "values") %>%
        mutate(timepoint = as.numeric(gsub(".*?([0-9]+).*", "\\1", timepoint)))
      
      #------
      # Merge
      #------
      
      tmp <- bind_rows(train2, test2)
      problemStorage[[i]] <- tmp
      
    }, error = function(e){cat("ERROR :",conditionMessage(e), "\n")})
  }
  
  problemStorage2 <- rbindlist(problemStorage, use.names = TRUE)
  return(problemStorage2)
}

allProbs <- pullTSCprobs()

2.3 Data structure

As this data was webscraped for ongoing purposes beyond this report, there are 42 different datasets available - corresponding to the different univariate time series classification problems available in the repository. A summary of some high-level descriptive properties of each is presented in the figure below.

p <- allProbs %>%
  group_by(problem) %>%
  summarise(tsLength = max(timepoint),
            trainSize = length(unique(id[set_split == "Train"])),
            testSize = length(unique(id[set_split == "Test"]))) %>%
  ungroup() %>%
  pivot_longer(cols = tsLength:testSize, names_to = "metric", values_to = "values") %>%
  mutate(metric = case_when(
    metric == "tsLength"  ~ "Time Series Length",
    metric == "trainSize" ~ "Train Set Size",
    metric == "testSize"  ~ "Test Set Size")) %>%
  group_by(metric) %>%
  mutate(ranker = dense_rank(values)) %>%
  ungroup() %>%
  mutate(metric = factor(metric, levels = c("Time Series Length", "Train Set Size", "Test Set Size"))) %>%
  ggplot(aes(x = tidytext::reorder_within(problem, -ranker, values, sep = "_"), y = values)) +
  geom_bar(stat = "identity", alpha = 0.9, fill = "#E494D3") +
  labs(x = "Time Series Problem",
       y = "Value") +
  scale_y_continuous(labels = comma) +
  facet_wrap(~metric, scales = "free") +
  theme(axis.text.x = element_text(size = 5, angle = 90))

print(p)
Descriptive statistics for each time-series dataset

Descriptive statistics for each time-series dataset

However, this report will focus on a single dataset - SonyAIBORobotSurface1. This dataset contains x-axis accelerometer recordings for the SONY AIBO Robot - a small quadruped dog-like robot - while it walked across either carpet or cement surfaces. The resulting data comprises a univariate two-class classification problem. Each time series represents one walk cycle for the robot. From a materials perspective, cement is much more solid than carpet, so one could intuit that the time series may show a higher level of sharpness in its changes. Depending on the composition of the robot, this may result in larger variance as the opposing forces of cement may produce more spring in robot movement. A visual check of the raw time series is useful to gauge any statistical peculiarities or issues with the import. A random sample of ten time series from SonyAIBORobotSurface1 - five from the train class and five from the test class - is presented in the figure below.

# Filter to just SonyAIBORobotSurface1

SonyAIBORobotSurface1 <- allProbs %>%
  filter(problem == "SonyAIBORobotSurface1")

# Randomly sample five IDs from each group

trainIDs <- SonyAIBORobotSurface1 %>%
  filter(set_split == "Train") %>%
  dplyr::select(c(id)) %>%
  distinct() %>%
  pull(id)

testIDs <- SonyAIBORobotSurface1 %>%
  filter(set_split == "Test") %>%
  dplyr::select(c(id)) %>%
  distinct() %>%
  pull(id)

# Filter dataset by ten IDs

set.seed(123) # Fix seed for reproducibility

trainSamps <- sample(trainIDs, 5)
testSamps <- sample(testIDs, 5)
allSamps <- append(trainSamps, testSamps)

# Produce graphic

p1 <- SonyAIBORobotSurface1 %>%
  filter(id %in% allSamps) %>%
  ggplot(aes(x = timepoint, y = values, colour = target)) +
  geom_line() +
  labs(x = "Timepoint",
       y = "Value",
       colour = "Class") +
  scale_colour_manual(values = c("#E494D3", "#87DCC0")) +
  facet_wrap(~id, ncol = 2) +
  theme(legend.position = "bottom",
        legend.key = element_blank())

print(p1)
Random subset of ten raw time series for SonyAIBORobotSurface1

Random subset of ten raw time series for SonyAIBORobotSurface1

2.4 Outliers and missing data

No missing data is present in the dataset (see figure below).

naniar::vis_miss(SonyAIBORobotSurface1, warn_large_data = FALSE)
Visualisation of missing data

Visualisation of missing data

The distribution of values (agnostic of timepoint) across both train-test split and class grouping is presented in the figure below. Evidently, value medians across each time series are similar (as indicated by horizontal line through each box), however, there are numerous outliers for each group split. Given the nature of the following analysis presented in this report (details provided in the next section), the presence of outliers and some heavy-tailed distributions of values is not a concern.

p2 <- SonyAIBORobotSurface1 %>%
  mutate(set_split = factor(set_split, levels = c("Train", "Test"))) %>%
  ggplot(aes(x = target, y = values, colour = target)) +
  geom_boxplot() +
  labs(x = "Class",
       y = "Values") +
  scale_colour_manual(values = c("#E494D3", "#87DCC0")) +
  facet_wrap(~set_split) +
  theme(legend.position = "none")

print(p2)
Visualisation of value distributions by group

Visualisation of value distributions by group

3 Research Question 1 - Do time-series features reveal empirical structure in the data?

The raw time series are interesting, however, they do not immediately provide us with informative findings about the differences in movement between carpet and cement, nor the temporal properties themselves. One way to approach this is to compute a series of features for each unique time series, and use the feature space to do analysis. A feature is a single summary statistic that is calculated on a vector of time-series values. Simple examples include a mean and standard deviation, though feature complexity can rise rapidly to quantities such as spectral entropy and GARCH coefficients. Bringing time series data to the feature space both reduces computational intensity and may increase the potential to find rich structure in the data that is not immediately visible from the measurement space.

3.1 Feature calculation

One feature set that has been particularly prominent in the academic literature is catch22 - a set of 22 time-series characteristics that have been shown to be minimally redundant while maximising classification accuracy across a broad range of problems. This feature set is implemented in R through the Rcatch22 package. For convenience, Rcatch22 and a host of other feature sets across both R and Python has been built into the package theft which automates the entire workflow from feature calculation to data visualisation in a format consistent with the broader tidyverse.

feature_matrix <- calculate_features(data = SonyAIBORobotSurface1, 
                                     id_var = "id", 
                                     time_var = "timepoint", 
                                     values_var = "values", 
                                     group_var = "target",
                                     feature_set = "catch22")

# Re-join set split labels

setlabs <- SonyAIBORobotSurface1 %>%
  dplyr::select(c(id, set_split)) %>%
  distinct() %>%
  mutate(id = as.character(id))

featMat <- feature_matrix %>%
  left_join(setlabs, by = c("id" = "id"))

3.2 Feature calculation quality

Any issues with the calculation of features can be understood easily by computing the proportion of value types (good, NaN, Inf/-Inf) by feature. The plot below depicts this. Evidently, no issues with feature computation are evident, with 100% of values being classified as good.

plot_quality_matrix(feature_matrix)

3.3 Low dimension representation

The new 22-dimension feature space can be further simplified by calculating and plotting a low dimension representation to assess if any empirical structure exists in the data. A common method for doing this is principal components analysis (PCA); whose first two components can be easily graphed as a scatterplot. The theft package automates the normalisation of features before the PCA calculation to avoid any issues that may arise with unequal variance across the features. A simple z-score normalisation method was chosen for this analysis as it enables an intuitive interpretation of the resulting values compared to other rescaling options. This plot is presented in the figure below. Evidently, there appears to be clear separation between the classes (carpet or cement) in the low dimension space of the first two principal components. This bodes well for the consideration of classification algorithms.

plot_low_dimension(featMat, 
                   is_normalised = FALSE, 
                   id_var = "id", 
                   group_var = "group", 
                   method = "z-score", 
                   low_dim_method = "PCA", 
                   plot = TRUE)
Low dimension representation of SonyAIBORobotSurface1

Low dimension representation of SonyAIBORobotSurface1

3.4 Matrix visualisations

Further structure in the data can be assessed through pairwise relationships between the feature vectors. These relationships can be plotted as matrices in a gradient-filled format that resembles a heatmap. This type of graphic is a standard output of theft functions, which also handles the computation of Pearson product-moment correlations and the hierarchical clustering of results to visually reveal any structure in the data once plotted in the heatmap. The z-score normalisation method was again applied for both of these graphics. This is presented for a matrix of Feature x Unique Time Series in the figure below.

plot_feature_matrix(featMat, 
                    is_normalised = FALSE, 
                    id_var = "id", 
                    method = "z-score")
Pairwise relationships between features and unique time series

Pairwise relationships between features and unique time series

A similar plot for the feature vectors of Unique Time Series x Unique Time Series is presented in the figure below.

plot_connectivity_matrix(featMat, 
                         is_normalised = FALSE, 
                         id_var = "id", 
                         names_var = "names", 
                         values_var = "values",
                         method = "z-score")
Pairwise relationships between unique time series

Pairwise relationships between unique time series

Evidently, neither matrix visualisation reveals particularly informative structure in the data, especially relative to the clarity of the low dimension plot. While the structure of the matrix visualisations (and the low dimension plot, to an extent) is contingent on the normalisation method selected, producing a comparative analysis and selection of an optimal normalisation method is beyond the scope of this report.

3.5 Summary

Evidently, a feature-based approach to time-series analysis can reveal interesting information about the empirical structure of the data. In particular, the low dimension representation strongly suggests that features could be used to predict class membership (robot walking on carpet or cement). This was visible through the clear separation of the classes in the two-dimensional space. This separation may be even more pronounced across the entire 22 dimensions, though this cannot be easily visualised. Extending this visualisation-based approach to more rigorous computational classification must be performed.

4 Research Question 2 - Can a time series feature-based classifier be trained to accurately predict labels in the test set?

With a strong understanding of the data, attention can be turned to tasks of immediate practical importance. The present report is particularly interested in whether a sample of the feature vectors be used to train a classifier that can accurately predict whether a robot was walking on carpet or cement in unseen data. This task is important, as learning the dynamics and temporal properties of the robot may better enable predictive technology such as risk assessment and damage management, among other uses.

The feature matrix can be wrangled into formats appropriate for a train-test modelling pipeline through the following:

# Separate into train-test splits

train <- featMat %>%
  filter(set_split == "Train") %>%
  mutate(group = as.character(group),
         group = factor(group, levels = c("1", "2")))

test <- featMat %>%
  filter(set_split == "Test") %>%
  mutate(group = as.character(group),
         group = factor(group, levels = c("1", "2")))

# Normalise train data, save its characteristics, and normalise test data onto its same scale

trainNormed <- train %>%
  group_by(names) %>%
  mutate(values = (values-mean(values, na.rm = TRUE))/stats::sd(values, na.rm = TRUE)) %>%
  ungroup()

trainScales <- train %>%
  group_by(names) %>%
  summarise(mean = mean(values, na.rm = TRUE),
            sd = sd(values, na.rm = TRUE)) %>%
  ungroup()

testNormed <- test %>%
  left_join(trainScales, by = c("names" = "names")) %>%
  group_by(names) %>%
  mutate(values = (values-mean)/sd) %>%
  ungroup() %>%
  dplyr::select(-c(mean, sd))

# Widen datasets for modelling

trainWide <- trainNormed %>%
  pivot_wider(id_cols = c(id, group), names_from = "names", values_from = "values") %>%
  drop_na()

testWide <- testNormed %>%
  pivot_wider(id_cols = c(id, group), names_from = "names", values_from = "values") %>%
  drop_na()

The classification algorithm chosen for this report is a Support vector machine (SVM) - a machine learning technique that seeks to fit a hyperplane of support vectors to the data and maximise the distance between the classes relative to the vectors. The classifier can be trained on the train data with the following:

# Dynamically create model formula to save typing out all the features

features <- colnames(trainWide)[3:24]
myformula <- as.formula(paste("group ~ ", paste(features, collapse = "+")))

# Fit classifier to train data

m1 <- svm(formula = myformula, 
             data = trainWide,
             kernel = 'linear')

Predictive accuracy can be assessed by predicting classes on the test set and computing a confusion matrix to compare predictions with the actual class labels. Evidently, the model predicted the labels of the unseen test data with a balanced accuracy of 86%, which is a promising result for such a fast modelling approach without any hyperparameter tuning.

# Predict class labels on test data

ypred <- predict(m1, testWide)

# Produce confusion matrix summary

confusionMatrix(testWide$group, ypred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2
##          1 245  98
##          2   1 257
##                                           
##                Accuracy : 0.8353          
##                  95% CI : (0.8032, 0.8641)
##     No Information Rate : 0.5907          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6788          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9959          
##             Specificity : 0.7239          
##          Pos Pred Value : 0.7143          
##          Neg Pred Value : 0.9961          
##              Prevalence : 0.4093          
##          Detection Rate : 0.4077          
##    Detection Prevalence : 0.5707          
##       Balanced Accuracy : 0.8599          
##                                           
##        'Positive' Class : 1               
## 

4.1 Summary

A feature-based approach to time-series analysis was successfully able to classify unseen data with an accuracy of 86% balanced between the classes. Practically, this means using the classification algorithm trained here to predict carpet or cement in the real world would yield a far more accurate estimate than if a guess was made. This is especially important, given the speed with which the model was trained and tested. Because it was so fast, this technique could be used in almost real-time if the surface underneath the robot was unknown. This may hold import ramifications in different areas of commercial and household robotics in terms of the design of the internal computer or in the remote control of robots in areas with poor visibility where the terrain is unknown.

5 Reflection on Data Wrangling

Data wrangling was crucial to undertaking this project. From wrangling of the initial webscraped datasets from wide time series in a non-native R format into tidy dataframes, to the production of exploratory data visualisations, to the calculation of features and development of a classification algorithm. Data wrangling enabled a question of practical importance to be adequately addressed using a novel approach. This project was fortunate in that no dates needed to be parsed, nor missing values handled, but the data wrangling pipelines presented here no doubt would have accommodated these needs easily if they were required. The tidyverse collection of packages in R greatly enables programmers to perform tedious and difficult wrangling tasks in an intuitive and efficient way, all of which is made straightforward by the simple and consistent API and philosophical underpinnings that each package within it utilises.

One of the most important pieces of the workflow which was made significantly easier with the data wrangling tools in R was the manipulation of dataframes from wide to long (and back to wide). Many time series datasets from outside of R are stored in wide format matrices without column names, which makes automatically plotting and summarising this data using tools such as dplyr and ggplot very difficult. Data wrangling tools such as tidyr enabled us to easily wrangle this data between long and wide, adjust and normalise columns, and perform other manipulations prior to computing summary statistics or producing graphics. Moreover, the final statistical learning model required each variable to be a column, meaning the previously long dataframes appropriate for faceted graphics were unsuitable for modelling. Data wrangling tools easily enabled this to be manipulated back into the correct format.